Travel pattern-based bus trip origin-destination estimation using smart card data

Smart card data are widely used in generating the origin and destination (O–D) matrix for public transit, which contains important information for transportation planning and operation. However, the generation of the O–D matrix is limited by the smart card data information that includes the boarding (origin) information without the alighting (destination) information. To solve this problem, trip chain methods have been proposed, thereby greatly contributing in estimating the destination using the smart card data. Nevertheless, unlinked trips, that is, trips with unknown destinations, are a persisting issue. The purpose of this study is to develop a method for estimating the destination of unlinked trips, in which trip chain methods cannot be applied, using temporal travel patterns and historical boarding records of the passengers based on long-term smart card data. The passengers were clustered by k-means clustering, and the time-of-day travel patterns were estimated for each cluster using a Gaussian mixture model. The travel patterns were formulated to estimate the destination of the passengers from the smart card data. The proposed method was verified using the 2018 smart card data collected in Sejong City, South Korea. The existing trip chain method matched the destinations of 60.0% of the total trips, whereas the proposed method improved the matching to 74.9% by additionally matching the destinations of 37.2% of the unlinked trips.


Introduction
Smart card data contain the usage and operation records of public transportation, and are comprehensive and widely used alternatives to the sampling data from public transportation surveys. Several studies have been conducted on the analysis of public transportation status using smart card data. Moreover, further studies, such as destination estimation using data mining processes, have been actively proposed. Previous studies have noted the advantages of smart card data for public transportation, as follows: next trip (or the first trip of the following day) and considering the allowable walking distance W r (Fig 1A). However, if there are no links, such as the last trip in Fig 1B or first trip in Fig 1C, the alighting stops cannot be estimated. In addition, in the case of a single trip, as shown in Fig  1D, the alighting stop cannot be estimated. This study developed a method for estimating the destination of unlinked trips (as shown in Fig 1B-1D) by incorporating the temporal travel patterns of public transit use obtained by longitudinally utilizing several days of smart card data. We clustered the passengers based on their historical boarding records and applied a mixture model to each cluster to generate travel patterns to estimate the destinations. Unlike the current trip chain methods that use the next and first trips of the day (or the following day) as the references for the estimation [4,5], the proposed method used historical travel records collected during several days as the references to estimate the destinations of unlinked trips.
This paper is organized as follows: the subsequent section presents a review of related studies on destination estimation and travel pattern analysis using smart card data. Next, the methodologies for generating the travel patterns and estimating the destinations are described. Subsequently, the verification of the developed model using the smart card data of Sejong City in Korea is presented. The improvement in the destination estimation of unlinked trips, which is a problem with the existing trip chain method, is evaluated. Finally, the conclusions and suggestions for future work are presented.

Literature review
O-D estimation is an important process for estimating the trip generation of each O-D pair in transportation planning and operation. Several estimation methods have been developed for O-D matrix generation. The trip chain method has been applied for the O-D estimation process of each passenger using smart card data [4][5][6]. However, the estimation of unlinked trips is a challenge when generating a full O-D matrix. This section focuses on previous studies related to trip destination estimation and passenger clustering using smart card data.

O-D estimation
The trip chain method, which is a representative method for estimating the trip destinations using smart card data, was first proposed by Barry et al. [4]. This method arranges the trips of each passenger with respect to the boarding time. If there is a stop within a certain distance from the next boarding stop among the stops constituting the current boarding route, the stop is estimated as a destination. Two assumptions were made to estimate the destination after the construction of the trip chain: 1) the boarding location of most passengers coincides with the alighting location of the previous trip, and 2) the destination of the last trip of the day is the first boarding location of the daily trip for most of the passengers [4]. The concept of maximum walking distance has been introduced to the trip chain method [5,[7][8][9]. The maximum walking distance refers to the threshold of the distance between the alighting location of the current trip and boarding location of the next trip; distances of 400 m to 2 km have been suggested [5,[8][9][10]. Consequently, a method for chaining trips using the transfer time has been proposed. Consecutive boarding trips of 30-90 min were linked to a trip chain [11][12][13].
In addition, trips on other days were referred to sample cases, wherein it was impossible to construct the trip chains within one day [5]. The trip chain method cannot be applied to a single trip because of the lack of linkage to other trips. Thus, the generalized time was introduced to improve the accuracy of the estimation and to estimate a stop, which considered the destination, based on the minimum generalized travel cost (sum of the in-vehicle travel time and transfer walking time) among the potential alighting stops [6]. The in-vehicle time was calculated based on the bus schedule at the boarding stop to improve the accuracy of the generalized time [14]. The changes in the estimation accuracy according to the change in the maximum walking distance and transfer time (threshold of the transfer decision) were presented [15]. The estimation accuracy of the trip chain method was measured using the smart card data, including the alighting information, wherein the optimal and most accurate maximum walking distance was found to be 500 m [16]. In addition to the trip chain method, some studies have applied learning-based methods for destination estimation. The temporal and spatial alighting probability of each stop was calculated using historical travel records to estimate the destination of unlinked trips [17]. A deep learning model was developed to estimate the destinations using smart card and land use data [18]. In addition, a deep learning approach was applied to generate personalized travel patterns and predict successive points of interest [19].
There are two types of conventional trip chain methods: quantitative methods, which increase the matching percentage by adding rules (or assumptions) for the chain linking, and qualitative methods, which improve the accuracy of the estimation. Recent studies have estimated the O-D matrix using new methods, such as applying a probability model to the trip chain method or completely reorganizing the methodology with a deep learning model. This paper proposes a method for estimating the destination of unlinked trips using smart card data, which increases the matching percentage of the trips using historical trip records from a quantitative point of view and improves the estimation accuracy using the temporal travel characteristics of the passengers from a qualitative point of view.

Passenger clustering
Recently, several studies have used smart card data to understand the travel patterns and behaviors of passengers [20][21][22][23][24]. This study also analyzes the travel pattern of each transit passenger to estimate their trip destination. As it is difficult to practically plan the public transportation operations according to the travel characteristics of each passenger, it is reasonable to group the passengers with similar travel characteristics and use them for planning and operation. The existing studies that analyzed the travel characteristics by clustering the temporal travel characteristics of the passengers are reviewed below.
In previous studies, partitioning methods represented by k-means are mainly used to cluster passengers using smart card data [13,20,25]. Some studies have applied hierarchical clustering methods [26], whereas others have applied the k-means method in combination with other clustering methods. After using k-means, the travel pattern was analyzed by applying the hierarchical ascendant classification [25]. Alternatively, a complex method of analyzing the travel regularity with the k-means and a clustering method based on DBSCAN for recognizing patterns were applied [13].
Several studies have approached clustering using model-based methods. Model-based clustering is employed to recover the original model from a dataset [27], which is suitable for modeling and analyzing data characteristics, such as travel patterns. Mixture models are mainly used as a model-based clustering method, where passengers with similar temporal travel patterns are clustered by a unigram mixture model or Gaussian mixture model (GMM) [3,[28][29][30]. Furthermore, a model combining k-means and GMM was applied to simultaneously cluster passengers and trips within each cluster [3,29,30]. This study applies clustering methods based on the temporal travel characteristics necessary for estimating the destination of the passengers and classifies the travel types using smart card data.

Methodology
Under the premise that adequate historical travel records (boarding location and time) are available, the following were assumed for estimating the destinations of unlinked trips. First, it was assumed that a passenger's destination is near a stop where he/she has frequently boarded. Second, it was assumed that the destination is dependent on the departure (boarding) time. Based on these two assumptions in this study, the destination of a passenger was defined as any one stop (or nearby) that is related to the boarding time of the trip among the stops from where the passenger has boarded with high frequency in the past. For example, for a passenger with a general commuting pattern, if a trip that occurred in the morning is regarded as a work-bound trip, there is a high probability that the destination is near the location where the passenger has boarded frequently in the evening, that is, it is inferred as the location of the office. Likewise, if a trip that occurred in the evening is regarded as a home-bound trip, it is considered that the destination is likely near the location where the passenger boarded frequently in the morning, that is, it is inferred as the location of the home. This study extended the time range for searching the reference trips compared to conventional trip chain methods, in which the reference trip is determined in the order of the next trip, first trip of the day, and first trip of the following day for estimating the destination.
To estimate the destination of a trip based on these assumptions, it is necessary to first analyze the characteristics of a passenger's boarding time, which are useful for estimating the destination. In this paper, "trip pattern" is defined as "the probability of trip generation at the boarding time of the day," and density function (distribution) of the probability of trip generation. The method of generating the trip patterns of the passengers consists of two steps: generating the travel profiles and extracting the travel patterns.

Travel profile generation
Because public transit is utilized only a limited number of times during a day, we attempted to analyze the travel characteristics by accumulating the usage records for several days longitudinally. It is necessary to set an appropriate time interval to count the trip frequency because the boarding time is recorded in seconds. That is, the smaller the time interval is, the lower the probability of counting the trips for each time bin is. In contrast, as the size of the time interval increases, the probability of aggregation increases but the expression of the travel characteristics may be oversimplified. El Mahrsi et al. [3] measured the distribution of the variability of the number of trips observed in stations for time bins ranging from 1 min to 12 h. They observed that the variability decreases as the size of the time bin is increased. Meanwhile the mobility patterns become more predictable and apparent, and the variability decreases significantly when increasing the size of the time bin from 1 min to 1 h. Thereafter, the decrease becomes less pronounced. Consequently, they considered 1-h time bins based on their apparentness and relevance of the mobility patterns [3], which was also considered in our study. We generated personal travel profiles by counting the trip records within the analysis period at time intervals of an hour for each passenger. Considering the operating hours of public transportation, the time range was set from 5 am to 11 pm. Accordingly, the travel profile of each passenger was expressed as a vector (c i ) composed of 19 components, as shown in Eq (1). As the number of trips varies for each passenger, the travel profile was generalized by modifying it in the form of the ratio of the number of trips within the time bin to the total number of trips.
where c i is the travel profile of passenger i, c b i is the total number (count) of boarding trips in time bin b of passenger i during the analysis period, B is the number of time bins (19 components), u i is the travel profile of passenger i, and p b i is the probability of the trip occurrence in time bin b for the total number of trips of passenger i. Fig 2 illustrates the travel profiles of each passenger. When generating a travel profile, it is necessary to identify the transfer trips. To prevent duplicating the number of trips, only the representative trip (first trips) among the single-purpose trips linked by the transfers should be used to generate the travel profile. For example, a passenger departs at 8:10 a.m., makes the first transfer at 8:30 a.m., and makes the second transfer at 8:50 a.m. In this example, if the transfer trips are not identified, the number of trips between 8:00 a.m. and 9:00 a.m. is counted as three. By identifying the transfer trip, the number of trips can be corrected to one trip. Thus, we identified the transfer trips based on temporal and spatial assumptions. If the difference between the boarding time of the previous trip and current trip is within the allowable time (t F ), and the distance between the alighting point of the previous trip and boarding point of the current trip is within the allowable walking distance (W r ), the current trip is determined to be a transfer trip. In this study, we assumed the thresholds for the transfer trips considering the size and structure of the city, that is, t F was set to 1 h, and W r of 500 m was applied, as proposed by Kim and Lee [16]. However, the allowable walking distance cannot be applied to the identification of the transfer trip because the smart card data do not have information on the alighting point. As an alternative, if there is a stop whose distance from the boarding point of the current trip is within the allowable walking distance among all the stops on the previous route, the current trip is considered to meet the spatial transfer criterion. The formula is as follows: Assume "T i;jþ1 " as a transfer trip of "T i;j " where T i,j is the passenger i's j-th trip; O i,j is the boarding point of T i,j ; x i,j is the boarding time of T i,j ; R i,j is the route of T i,j ; z is a potential alighting point of T i,j (z is a stop along the route of T i,j and occurs after O i,j ); and d E (a, b) is the Euclidean distance between locations a and b. In this study, the Euclidean distance was applied to the calculation of the walking distances because of the limited pedestrian network data. However, the use of the available pedestrian network data allows a more precise calculation of the transfer distance. Based on the viewpoint that a travel pattern is generated by the repetition of trips, it is necessary to set a minimum condition for the number of days the public transportation is utilized during the analysis period. We generated the travel profiles for those who used public transportation for more than four days, which was approximately 10% of the analysis period of 38 days. Therefore, passengers who used public transportation for less than four days during the analysis period were excluded from the travel profile and pattern generation owing to insufficient data for extracting their travel patterns. As counting the trips based on the time bins may cause a boundary problem, it is necessary to use the data from several days to mitigate this issue.

Travel pattern extraction
As the next step in generating the travel profile, the generated individual travel profiles were clustered into K clusters according to the similarity in the travel patterns. Briand et al. [30] clustered the travel profiles using GMM. We modified the model to perform behavior analysis of public transit usage. Briand et al. [30] extracted the travel patterns, that is, the density functions of the boarding time, using a two-step generative model. First, the travel profiles were clustered into K clusters, and then the distribution of each cluster was estimated using a GMM. The estimated distribution of each cluster is a polynomial distribution in the form of a mixture of H Gaussian distributions. Similarly, this study employed two steps: the clustering of the profiles and GMM estimation. However, the method of applying the solution of the algorithm was different. Briand et al. [30] fixed both the number of clusters and number of Gaussians before applying the solution; thus, all clusters had the same number of Gaussians. As the number of Gaussians, which is the distribution of the concentrated boarding time, is also a characteristic of each cluster, we modified the method so that the number of Gaussians was different for each cluster. To apply this concept, we first clustered the travel profiles. Subsequently, the GMM was estimated for each cluster. Fig 3 shows the process of generating the travel profiles and patterns through clustering and GMM estimation. The first step (clustering) was used to generate macro-travel patterns using the aggregated data by time bins, whereas the second step (GMM estimation) was applied to estimate the boarding time distributions of each cluster using the original time data. This study applied the partitioned clustering method by considering the data characteristics of the travel profile set, which contains large and high-dimensional variable data. As a similarity measurement method, the Euclidean distance, which is the most commonly used approach for clustering time series data [27], was applied. k-means, k-median, and k-medoid meet these clustering algorithm and similarity measurement conditions. As the computation time is proportional to the square of the number of samples, k-median and k-medoid are not suitable for application to large-scale data; therefore, k-means was chosen as the clustering method.
Clustering by k-means is suitable for large-scale and high-dimensional data. In this study, the number of clusters was determined using the elbow method, which is a heuristic method. Considering the homogeneity of the travel profile within the cluster, the total sum of squared errors (SSE) for each cluster according to the increase in the number of clusters was measured, and the point (elbow) at which the change in this value slowed down was chosen as the number of clusters. Because the clustering object comprised the travel profiles of the passengers, the cluster to which each passenger belonged was determined to be the output of the clustering. Therefore, the cluster numbers k i , k i 2{1, 2, � � �, K}, were added to each passenger and each trip as follows: The GMM can be expressed as follows: Eq (3) suggests that trip j of passenger i belonging to cluster k i follows a polynomial distribution consisting of a mixture of h, h2{1,� � �,H k } distributions. Z i,j is a latent variable, and the parameter τ k,h is the proportion of each distribution. Eq (4) indicates that the boarding time x i, j of trip j of passenger i follows one of the H k Gaussian distributions constituting cluster k i to which passenger i belongs. The mean and standard deviation of each Gaussian distribution are μ k,h and σ k,h , respectively. The parameters were estimated using the expectation maximization algorithm, which is a traditional solution of the mixture model. The Gaussian mixture distributions were estimated for each cluster, and the Gaussian distribution number h to which the trip belonged was added to each trip.
H k , which is the number of Gaussians for each cluster, is also a parameter that must be set like the number of clusters K. We estimated the Gaussian mixture distribution of the boarding times of the trips belonging to the cluster by changing H k in the range of 1 to N for each cluster. The Gaussian number H k was chosen according to the integrated completed likelihood criterion. Through the GMM estimation, the distribution of the boarding times of the passengers belonging to each cluster was estimated, as shown in the examples in Figs 4 and 5.
As a result, the travel pattern can be represented by H k sets of Gaussian parameters (τ k,h , μ k, h , σ k,h ). Given the boarding time x i,j of a certain trip, the probability of that trip for each Gaussian distribution can be calculated using the Gaussian parameters and probability equation of the Gaussian distribution. The most probable distribution is assigned to h i,j , which is the Gaussian distribution to which the trip belongs.

Destination estimation
The trip chain method estimates the destination of a trip by referring to other trips of that passenger on the same day or on nearby dates. Unlike the traditional trip chain method, our proposed destination estimation method estimates the destination of a trip by referring to trip records and travel patterns of the past days, months, or years. This section describes the process of referring to historical trip records for the destination estimation.  passengers belong was assigned to the passengers who satisfy the condition for extracting the travel patterns. In this study, the condition was that the passenger must have used public transit for more than a total of four days. As mentioned at the beginning of this section, it was assumed that the destination of a trip depends on the departure (boarding) time. By applying this assumption, we assumed that the destination of a trip depends on the time section of the boarding time to which the trip belonged. In addition, according to the basic assumption of this study, the locations from where a passenger frequently boards can be potential destinations for that passenger. However, if the historical boarding records in the same time section are referred to as the destinations for trips boarded in a time section, a problem may arise where the origin and destination are the same location. Therefore, the destination of a trip in a time section (h) was estimated by referring to the historical boarding records in other time sections (h � , h � 6 ¼h), i.e., excluding the analyzed time section.
He and Trépanier [17] stored the historical records of the passengers' alighting points on a trip route as potential destinations. They developed the density functions of the alighting probability according to the alighting frequency and time concentration for each potential destination, and estimated the stop with the highest probability as the destination. We examined the use of historical records of potential destinations by He and Trépanier [17], and employed this concept to estimate the destination based on previously described travel patterns. However, this paper proposes a method using the historical boarding records without the historical alighting records. Using the route information of the trip to be estimated, the nearby stops and boarding frequency at each stop were stored for each of the stops along route m. We applied W r of 500 m, as suggested by Kim and Lee [16], as the nearby radius. Fig 6 shows the concept of generating the travel records for each passenger.
To correlate the travel patterns with the historical boarding records, the latter were collected for each Gaussian distribution number h corresponding to each time section. As shown in Fig  7, if a passenger follows a traffic pattern of a mixture of two Gaussians, a boarding record set was generated for each h = 1 and h = 2.
After the set of historical boarding records for each time section was generated for each potential stop, p h ðS q m Þ, which is the alighting probability in the time section h of the potential stop S q m , was calculated using the historical boarding frequencies of each stop, where q, q2 {1,� � �,Q m }, is the stop sequence number, and s, s2{1,� � �,S}, is the ID number of each stop. where h � is a time section other than time section h; q b is the stop sequence of the boarding stop of trip T i,j , (q�q b +1); D s q is a dummy variable indicating whether S q m and B s i are nearby (1: . The stop with the highest probability of alighting was estimated as the destination. When there were several stops with the same probability, the stop with the smallest stop sequence was identified as the destination to prevent choosing an incorrect stop. In the case of two time sections, as shown in Fig 4, the destinations were estimated by cross-referencing. Meanwhile, for three or more time sections, as shown in Fig 5, the priority of the reference should be determined. Considering that Trépanier et al. [5] estimated the destination by referring to the first trip of the day for analyzing the last trip of the day, we estimated the destinations by primarily referring to the historical boarding records in the first time section (h = 1) for the trips in the second or later sections (h�2). As self-referencing was not possible for the trips in the first time section, the time section with the largest τ k,h value, which is the proportion of each time section (distribution) estimated in the travel pattern generation step to the second or later time sections, was primarily referred. After the primary reference, the trips of that section were referenced in the order of the time section with the largest τ k,h value. For example, in the case of τ k,h = 1 >τ k,h = 3 >τ k,h = 2 in a travel pattern that is a mixture of three distributions, the trips in h = 3 and h = 2 refer primarily to the historical boarding records in the first time section (h = 1). The trips in h = 1 refer primarily to the historical boarding records at h = 3, which is the section with the largest τ k,h value. For h = 1, h = 2, and h = 3, for the trips whose destinations could not be estimated in the primary reference, the historical boarding records in the remaining time sections, i.e., h = 2, h = 3, and h = 2, respectively, were referred.

Data description
We verified the proposed method using the smart card data from Sejong City, which is an administrative capital city in the Republic of Korea with a population of approximately 300,000 and area of 464.9 km 2 . The data were provided by the Ministry of Land, Infrastructure and Transport (the Republic of Korea) in October 2018 for transportation-related analysis and research. The card numbers were encrypted to prevent the identification of the individual travelers. Smart card data for 42 weekdays in the period April 1 and May 31, 2018 were used. During the analysis period, 128 bus routes and 1,293 bus stops were in operation in Sejong City. The data for 38 days were used to generate the travel patterns as historical boarding records, and data for four days (from May 28 to May 31) were used for validation. The smart card data of Sejong City consist of the transaction and operation information data. The transaction data include public transport usage records, such as card number (encrypted for privacy), transportation modes, route number, boarding stop, and boarding time. The operation information data include the route information, stop information (name and coordinates), and stop configuration for each route (set and sequence number of stops constituting the route). During the analysis period, a total of 246,516 passengers used public transportation. The data consisted of 2,253,840 bus transactions. There were 1,733,625 transaction records with the destination (alighting stop) data from the smart card data, which constituted 76.9% of the total transactions. To estimate the destinations of the transaction records using the developed method, we removed the destination data from the smart card data and re-inferred the destinations based on the boarding stop data. The destination data from the original data were only used to verify the estimation accuracy.

Travel profile and travel pattern generation results
The original data comprised 1,614,887 trip records generated by 194,730 passengers, and 1,595,470 trips remained when the transfer trips were removed. Because the travel profiles comprise data for generating travel patterns according to the repetitions of trips, the travel profiles were generated only for passengers who had used public transportation for at least four days, which was approximately 10% of the analysis period. Consequently, we generated the travel profiles using 1,331,025 (82.4% of original data) trip records generated by 63,014 (32.4% of the original data) passengers.
Before clustering the travel profiles, the number of clusters K and number of Gaussians H k for each cluster should be determined. The number of clusters K was determined using the elbow method, as described above. Fig 8 shows the measurement result of the total SSE of each cluster when the number of clusters K was increased from 1 to 100. Clustering was considered suitable if the sum of the SSEs of each cluster decreases. When there were 20 clusters, an elbow where the error improvement efficiency clearly decreased was observed. Consequently, we  (τ k,h , μ k,h , σ k,h ) constituting each Gaussian are important in the quantitative analysis of the travel patterns. Hence, each travel pattern can be quantitatively expressed as a set of parameters, as listed in Table 1. Taking cluster 7 as an example, a trip belonging to cluster 7 has a 22% probability of belonging to a distribution centered at 8.46 hours (approximately 8:30), whereas the probability of belonging to a distribution centered at 21.52 hours (approximately 21:30) is 40%, and the probability of belonging to the rest of the distribution is 38%. Two peaks of the three Gaussian distributions are distinctly observed in the figure, but one of them has a relatively low and broad distribution.

Destination estimation result
Because the destination estimation method proposed in this paper refers to the historical boarding records of time sections (Gaussians) other than the time section (Gaussian) to which the trip belongs, the destinations of the trips belonging to clusters with only one time section (H k = 1) cannot be estimated. Therefore, the destination estimation method was applied to 18 clusters, except for clusters 2 and 16, which had travel patterns composed of only one time section. In addition, the proposed method cannot be applied to passengers who do not have a travel pattern owing to their low frequency of travel (defined as those who have used public transportation for less than four days in this study). The validation data comprised a total of 116,194 transactions, including the alighting information, and no logical errors in the sequence between the boarding and alighting stops were noted. These transactions can be used to verify the estimation accuracy.
The proposed destination estimation approach estimated 44.0% of the destinations. Compared with the actual alighting point, the estimation accuracy of the number of trips was 48.4%, and that of the total number of trips was 21.3% (= 44.0% × 48.4%). Because this method refers to the boarding records within the allowable walking distance (500 m) from the potential destination, the initial distance error of the method was 500 m. Therefore, one of the adjacent stops within 500 m could have been incorrectly estimated as the destination stop. Applying the "relaxed rule" to the accuracy metric, that is, the estimated alighting stop is within one stop difference (including previous and next stop) from the actual alighting stop, the accuracy of the estimated number of trips was 76.7%, and the accuracy of the total number of trips was 33.8%. Fig 10 shows the estimation accuracy for each cluster. The accuracy metric by the "normal rule" indicates that the estimated and actual alighting stops were the same. Table 1 summarizes the Gaussian mixture parameters for each cluster. In Table 1, τ k,h represents the proportion of trips belonging to each time section, and μ k,h represents the time concentration for a specific time section. The estimation accuracies of clusters 10 and 18 were relatively higher than those of the other clusters, and both clusters were characterized by high proportions of morning Table 1. Gaussian mixture parameters of the travel pattern for each cluster. Note: The parameter τ kh is the proportion of each distribution, the parameter μ kh is the mean (Gaussian center) of each distribution, and the parameter σ kh is the standard deviation of each distribution.  )   3  1  2  4  3  3  3  2  3  3  3  3  3  3  3  1  3  2 peak hour trips (h = 1), as shown in Table 1 (57.4% for cluster 10 and 51.2% for cluster 18). Therefore, our method was suitable for passengers with high concentration of trips in the morning peak hour. Fig 11 shows the estimation accuracy according to the boarding time. The boarding times of the passengers were divided into four groups according to the range of the fluctuations of the estimation accuracy. The estimation accuracies for the morning and evening trips were high, and the estimation accuracy for the daytime off-peak hours was low. This means that the morning and evening trips were more likely to follow a pattern than the daytime trips. Fig 12A shows the estimation accuracy based on the distance travelled. The estimation accuracies were relatively higher for the distance of 10-15 km and remained constant up to 20 km. When the distance exceeded 20 km, the accuracies slightly decreased. Fig 12B presents the estimation errors according to the distance travelled. The numbers represent the averages of the sequential differences between the estimated and actual stops (e.g., 0 if correctly estimated). The error increased with the travel distance. Therefore, the developed method achieved a relatively good performance for trips with a travel distance of less than 20 km.

Improvement of the current algorithm
The purpose of this study was to improve the destination estimation method of unlinked trips by referring to the historical travel patterns of the passengers. To generate a set of unlinked trips, the destination should first be estimated using the trip chain method. Trips that cannot be estimated by the chain method form a set of unlinked trips. We used the method proposed by Barry et al. [4] and applied the allowable walking distance of 500 m. Among the chosen potential destinations, the stop with the shortest generalization distance was estimated as the destination of the trip. The generalized distance [16] was calculated by adding the generalized transfer distance to the in-vehicle distance from the boarding stop to the potential stop z (d z ). The generalized transfer distance was the product of the walking resistance (f w , 1.5 in this study), walking-to-bus speed conversion factor (c b−w , 5 in this study), and distance d z−post from the potential stop z to the next boarding point.
We improved the existing trip chain-based destination estimation method by applying the travel patterns to the unlinked trips, which cannot be estimated by the trip chain method. Table 2 shows the results of the destination estimation by applying the travel pattern method of this study to the set of unlinked trips generated by employing the trip chain method to the validation data (116,194 trips) in the previous section.
When the current trip chain method was applied, the destinations of 60.0% of the trips could be estimated (matched) with the estimation accuracy of 68.8%. With respect to the total number of sample trips rather than the matched number of trips, the estimation accuracy was 41.2%. However, 46,532 trips (40.0%) were generated as unlinked trips without destination matching. By applying the travel pattern method proposed in this paper to these unlinked trips, 37.2% of the destinations could be matched, and the estimation accuracy of the matched trips was 46.6%.
When the destination was estimated by combining the trip chain method and travel pattern method, the matching percentage was improved from 60.0% to 74.9%. Although the estimation accuracy for the matched trips decreased, that for the total trips improved from 41.2% to 48.2% because the number of trips matched with high accuracy increased. As a result, accurate estimation was realized for 48.2% of the total trips and 65.6% of the trips within one stop difference. Therefore, the proposed method of estimating the destinations using the travel patterns and historical boarding records can improve the destination estimation problem of unlinked trips.

Conclusions
In this paper, we proposed a longitudinal method that applies the temporal travel patterns and historical boarding records of public transportation passengers obtained from long-term smart card data to estimate the destinations of unlinked trips, which could not be analyzed using the trip chain method. Travel profiles were generated using the temporal frequency of the travel for each passenger. The passengers were clustered by k-means clustering. The time-of-day travel patterns were estimated for each passenger cluster using a GMM. The time sections were partitioned with the temporal characteristics of the trips, following the GMM results. Potential destinations were created to apply the temporal travel patterns to the destination estimation by aggregating historical boarding locations (stops) for each time section. Finally, a stop with high historical frequency of boarding, selected among the potential destinations in the time section to which the trip does not belong, was matched as the destination of the trip.
We applied this method to the smart card system of Sejong City, which contained alighting information that could be used for verification. As a result, 44.0% of the trips were matched. For 48.4% of the destination-matched trips, the destination was accurately estimated to the actual alighting point. Compared to existing destination estimation methods using smart card data using the trip chain method only, the proposed travel pattern method applied to the unlinked trips generated from the trip chain method increased the matching percentage by 14.9%. The number of trips with their destinations matched within the difference of one bus stop from the actual alighting stop also increased by 6.9% compared with the existing trip chain method. Therefore, the proposed method of estimating the destinations using travel patterns and historical boarding records could improve the destination estimation for unlinked trips. However, the proposed method has some limitations. First, as the proposed method assumes a travel pattern for each passenger, the destinations of the passengers who use public transportation less frequently cannot be estimated. In particular, the destinations of the passengers who used public transportation for less than four out of 38 days, which comprised 24.6% of the total trips, could not be estimated. Second, because the destination is estimated by referring to the potential destinations belonging to the time sections other than the time section to which the trip belongs, the destinations of the trips in the clusters with a travel pattern composed of only one Gaussian cannot be estimated; the proportion of such cases was approximately 26.3%.
In the future, it will be necessary to verify whether the possibility of generating the travel patterns of the passengers can be increased by longitudinal expansion of data and to develop an estimation algorithm for trips belonging to a single Gaussian pattern. Moreover, we need to explore the developed models by considering the correlation between the spatio-temporal characteristics, including the changes in the travel patterns of the passengers over time and actual transfer distance from the pedestrian network. The improved method is expected to achieve a higher matching rate for the O-D matrix of public transportation. Although we have generated discrete travel profile data, studies that apply a continuous clustering method to continuous time data are also needed to accurately understand the travel behaviors.